This notebook does topic modelling via BERTopic on the Kaggle dataset and the LIAR dataset.
We follow https://www.youtube.com/watch?v=v3SePt3fr9g Note we do not remove stop words in the BERTopic approach. Can merge topics: https://youtu.be/uZxQz87lb84?t=1002
Note that by default, BERTopic uses sentence-transformers/all-MiniLM-L6-v2 for its dense $384$D vector space embedding, and then does dimensionality reduction via UMAP (default) or PCA or truncated SVD or skipped.
from bertopic import BERTopic
import pandas as pd
# Read in Kaggle titles only. The dataframe df_kaggle has two columns: article titles, and (true/false).
kaggle_df_true = pd.read_csv('./kaggle_dataset/True.csv', usecols = [0])
kaggle_df_fake = pd.read_csv('./kaggle_dataset/Fake.csv', usecols = [0])
corpus_true = kaggle_df_true.loc[:,'title'].tolist()
corpus_fake = kaggle_df_fake.loc[:,'title'].tolist()
topic_model_true = BERTopic(embedding_model="all-MiniLM-L6-v2")
topics_true, probs_true = topic_model_true.fit_transform(corpus_true)
topic_model_fake = BERTopic(embedding_model="all-MiniLM-L6-v2")
topics_fake, probs_fake = topic_model_fake.fit_transform(corpus_fake)
topic_model_true.get_topic_info()
| Topic | Count | Name | |
|---|---|---|---|
| 0 | -1 | 5998 | -1_korea_trump_north_white |
| 1 | 0 | 461 | 0_tax_reform_corporate_rate |
| 2 | 1 | 419 | 1_brexit_uk_eu_britain |
| 3 | 2 | 343 | 2_china_xi_chinese_graft |
| 4 | 3 | 340 | 3_iran_nuclear_deal_sanctions |
| ... | ... | ... | ... |
| 342 | 341 | 10 | 341_malaysia_beer_festival_suspected |
| 343 | 342 | 10 | 342_deportation_deportations_raids_reprieves |
| 344 | 343 | 10 | 343_smuggling_migrants_brave_macedonian |
| 345 | 344 | 10 | 344_secretary_army_fanning_nominate |
| 346 | 345 | 10 | 345_felons_virginia_voting_restoring |
347 rows × 3 columns
topic_model_fake.get_topic_info()
| Topic | Count | Name | |
|---|---|---|---|
| 0 | -1 | 8173 | -1_trump_in_to_video |
| 1 | 0 | 535 | 0_melania_women_ivanka_sexual |
| 2 | 1 | 489 | 1_students_college_school_student |
| 3 | 2 | 381 | 2_obama_barack_president_speech |
| 4 | 3 | 294 | 3_black_racist_supremacists_white |
| ... | ... | ... | ... |
| 414 | 413 | 10 | 413_obstruction_niece_fired_grassroots |
| 415 | 414 | 10 | 414_rioter_want_protester_priceless |
| 416 | 415 | 10 | 415_100_interviewer_express_disgust |
| 417 | 416 | 10 | 416_cover_globe_boston_runs |
| 418 | 417 | 10 | 417_sue_amendment_suing_tweet |
419 rows × 3 columns
From the documentation, the list of outliers can be assigned to the existing topics via:
Second, after training our BERTopic model, we can assign outliers to topics by making use of the .reduce_outliers function in BERTopic. An advantage of using this approach is that there are four built in strategies one can choose for reducing outliers. Moreover, this technique allows the user to experiment with reducing outliers across a number of strategies and parameters without actually having to re-train the topic model each time. You can learn more about the .reduce_outlier function here. The following is a minimal example of how to use this function:
# Reduce outliers
new_topics_true = topic_model_true.reduce_outliers(corpus_true, topics_true)
topic_model_true.visualize_barchart(width=180, height=400, top_n_topics=10, n_words=10)
topic_model_fake.visualize_barchart(width=180, height=400, top_n_topics=10, n_words=10)
# Filter above data frame by topic 0 only:
topic_true_df = pd.DataFrame({"topic": topics_true, "document": corpus_true})
topic_true = topic_true_df[topic_true_df.topic == 4]
for i in range(16):
print(topic_true['document'].values[i])
U.S. calls Myanmar moves against Rohingya 'ethnic cleansing' U.S. hopes to pressure Myanmar to permit Rohingya repatriation U.S. Congress members decry 'ethnic cleansing' in Myanmar; Suu Kyi doubts allegations Myanmar operation against Rohingya has 'hallmarks of ethnic cleansing', U.S. Congress members say U.S. lawmakers target Myanmar military with new sanctions Tillerson tells Myanmar army chief U.S. concerned about reported atrocities U.S. weighs calling Myanmar's Rohingya crisis 'ethnic cleansing' U.S. officials will not label treatment of Rohingya as 'ethnic cleansing' U.S. says holds Myanmar military leaders accountable in Rohingya crisis Lawmakers urge U.S. to craft targeted sanctions on Myanmar military Senators urge Trump administration to act on Myanmar Rohingya Exclusive: Overruling diplomats, U.S. to drop Iraq, Myanmar from child soldiers' list Obama announces lifting of U.S. sanctions on Myanmar Exclusive: U.S. to renew most Myanmar sanctions with changes to aid business Senate unanimously approves Myanmar ambassador nominee Senate panel approves Myanmar nominee
# Filter above data frame by topic 1 only:
topic_fake_df = pd.DataFrame({"topic": topics_fake, "document": corpus_fake})
topic_fake = topic_fake_df[topic_fake_df.topic == 4]
for i in range(16):
print(topic_fake['document'].values[i])
Bad News For Trump — Mitch McConnell Says No To Repealing Obamacare In 2018 Maine Voters Tell Trump To Go F*ck Himself, Expand Medicaid Through Obamacare The Numbers Are In: States, Insurers Literally Say Obamacare Trainwreck Is TRUMP’S Fault Trump’s Press Secretary Falls Apart, Exposes His Lie About Obamacare Vote (VIDEO) Trumpcare Is Officially Dead, Senator Collins Confirms She’s Voting No WATCH: GOP Senator Yawns As Disabled Healthcare Protesters Are Being Dragged Away By Cops Trump’s Making It Harder To Sign Up For Obamacare On Purpose, Even If The GOP Doesn’t Pass Anything Republican Senator STUNS In Town Hall, Admits GOP’s ObamaCare Repeal Will Fail (TWEETS) Medicaid Directors Of All 50 States Issue Joint Statement Slamming GOP Health Bill Trump Regrets, Move Over: ‘Sassy Gay Republican’ Is All Of The Healthcare Angst We Need Right Now Obama Just Made A VERY Powerful Statement About Trump’s Attempts To Repeal Obamacare (VIDEO) Trump Vows To Save America From ‘Curse’ Of Functional Health Care System Fed Up With Congress, Trump Just Put A Big Nail In Obamacare’s Coffin Want To Ride On Air Force One As A Senator? Then Be Prepared To Vote For Trump’s Health Care Bill Trump Is Raising Your Healthcare Premiums, And That’s A Fact Republican Senator Predicts Trump’s Next Big Legislative Push Will Fail Just Like Trumpcare
From the above, we see that the Boiler Room Podcast (conspiracy theories?) should be removed from preprocessing! https://alternatecurrentradio.com/category/boiler-room/, https://www.youtube.com/@AlternateCurrentRadio/featured, https://alternatecurrentradio.com/voodoo-nipple-calculus/, https://alternatecurrentradio.com/world-war-mrna/